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ABSTRACT 

Privacy is an increasingly important aspect of data publishing. Rea- 
soning about privacy, however, is fraught with pitfalls. One of 
the most significant is the auxiliary information (also called exter- 
nal knowledge, background knowledge, or side information) that 
an adversary gleans from other channels such as the web, public 
records, or domain knowledge. This paper explores how one can 
reason about privacy in the face of rich, realistic sources of auxil- 
iary information. Specifically, we investigate the effectiveness of 
current anonymization schemes in preserving privacy when mul- 
tiple organizations independently release anonymized data about 
overlapping populations. 

1 . We investigate composition attacks, in which an adversary uses 
independent anonymized releases to breach privacy. We explain 
why recently proposed models of limited auxiliary information 
fail to capture composition attacks. Our experiments demon- 
strate that even a simple instance of a composition attack can 
breach privacy in practice, for a large class of currently pro- 
posed techniques. The class includes fc-anonymity and several 
recent variants. 

2. On a more positive note, certain randomization-based notions 
of privacy (such as differential privacy) provably resist compo- 
sition attacks and, in fact, the use of arbitrary side information. 
This resistance enables "stand-alone" design of anonymization 
schemes, without the need for explicitly keeping track of other 
releases. We provide a precise formulation of this property, and 
prove that an important class of relaxations of differential pri- 
vacy also satisfy the property. This significantly enlarges the 
class of protocols known to enable modular design. 

1. INTRODUCTION 

Privacy is an increasingly important aspect of data publishing. 
The potential social benefits of analyzing large collections of per- 
sonal information (census data, medical records, social networks) 
are significant. At the same time, the release of information from 
such repositories can be devastating to the privacy of individuals or 
organizations 1 5 1. The challenge is therefore to discover and release 
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the global characteristics of these databases without compromising 
the privacy of the individuals whose data they contain. 

Reasoning about privacy, however, is fraught with pitfalls. One 
of the most significant difficulties is the auxiliary information (also 
called external knowledge, background knowledge, or side infor- 
mation) that an adversary gleans from other channels such as the 
web or public records. For example, simply removing obviously 
identifying information such as names and address does not suf- 
fice to protect privacy since the remaining information (such as 
zip code, gender and date of birth |30| ) may still identify a per- 
son uniquely when combined with auxiliary information (such as 
voter registration records). Schemes that resist such linkage have 
been the focus of extensive investigation, starting with work on 
publishing contingency tables |1|, and more recently, in a line of 
techniques based on "fc-anonymity" (30). 

This paper explores how one can reason about privacy in the 
face of rich, realistic sources of auxiliary information. This fol- 
lows lines of work in both the data mining |26 27 9| and cryp- 
tography |10| [T2) communities that have sought principled ways 
to incorporate unknown auxiliary information into anonymization 
schemes. Specifically, we investigate the effectiveness of current 
anonymization schemes in preserving privacy when multiple or- 
ganizations independently release anonymized data about overlap- 
ping populations. We show new attacks on some schemes and also 
deepen the current understanding of schemes known to resist such 
attacks. Our results and their relation to previous work are dis- 
cussed below. 

Schemes that retain privacy guarantees in the presence of inde- 
pendent releases are said to compose securely. The terminology, 
borrowed from cryptography (which borrowed, in turn, from soft- 
ware engineering), stems from the fact that schemes which com- 
pose securely can be designed in a stand-alone fashion without 
explicitly taking other releases into account. Thus, understanding 
independent releases is essential for enabling modular design. In 
fact, one would like schemes that compose securely not only with 
independent instances of themselves, but with arbitrary external 
knowledge. We discuss both types of compositions in this paper. 

The dual problem to designing schemes with good composition 
properties is the design of attacks that exploit such information. We 
call these composition attacks.A simple example of such an attack, 
in which two hospitals with overlapping patient populations pub- 
lish anonymized medical data, is presented below. Composition 
attacks highlight a realistic and important class of vulnerabilities. 
As privacy preserving data publishing becomes more commonly 
deployed, it is increasingly difficult to keep track of all the organi- 
zations that publish anonymized summaries involving a given in- 
dividual or entity and schemes that are vulnerable to composition 
attacks will become increasingly difficult to use safely. 



1.1 Contributions 

Our contributions are summarized briefly in the abstract, above, 
and discussed in more detail in the following subsections. 

1. 1. 1 Composition Attacks on Partition-based Schemes 
We introduce composition attacks and study their effect on a 
popular class of partitioning-based anonymization schemes. Very 
roughly, computer scientists have worked on two broad classes of 
anonymization techniques. Randomization-based schemes intro- 
duce uncertainty either by randomly perturbing the raw data (a 
technique called input perturbation, randomized response, e.g., 1 34 
|2]|16|), or post-randomization, e.g., (32)), or by injecting random- 
ness into the algorithm used to analyze the data (e.g., (6l [28)). 
Partition-based schemes cluster the individuals in the database into 
disjoint groups satisfying certain criteria (for example, in fc-anony- 
mity 1 30 1, each group must have size at least fc). For each group, 
certain exact statistics are calculated and published. Partition-based 
schemes include fc-anonymity |30) as well as several recent vari- 
ants, e.g., (26] [23] [36] [27] [9). 

Because they release exact information, partition-based schemes 
seem especially vulnerable to composition attacks. In the first part 
of this paper we study a simple instance of a composition attack 
called an intersection attack. We observe that the specific proper- 
ties of current anonymization schemes make this attack possible, 
and we evaluate its success empirically. 

Example. Suppose two hospitals H 1 and H2 in the same city re- 
lease anonymized patient-discharge information. Because they are 
in the same city, some patients may visit both hospitals with sim- 
ilar ailments. Tables [T]a) and[TJb) represent (hypothetical) inde- 
pendent fc-anonymizations of the discharge data from Hi and H2 
using k = 4 and k = 6, respectively. The sensitive attribute here 
is the patient's medical condition. It is left untouched. The other 
attributes, deemed non-sensitive, are generalized (that is, replaced 
with aggregate values), so that within each group of rows, the vec- 
tors if non-sensitive attributes are identical. If Alice's employer 
knows that she is 28 years old, lives in zip code 13012 and re- 
cently visited both hospitals, then he can attempt to locate her in 
both anonymized tables. Alice matches four potential records in 
iii's data, and six potential records in //2's. However, the only 
disease that appears in both matching lists is AIDS, and so Alice's 
employer learns the reason for her visit. 

Intersection Attacks. The above example relies on two proper- 
ties of the partition-based anonymization schemes: (i) Exact sensi- 
tive value disclosure: the "sensitive" value corresponding to each 
member of the group is published exactly; and (ii) Locatability: 
given any individual's non-sensitive values (non-sensitive values 
are exactly those that are assumed to be obtainable from other, pub- 
lic information sources) one can locate the group in which individ- 
ual has been put in. Based on these properties, an adversary can 
narrow down the set of possible sensitive values for an individual 
by intersecting the sets of sensitive values present in his/her groups 
from multiple anonymized releases. 

Properties (i) and (ii) turn out to be widespread. The exact dis- 
closure of sensitive value lists is a design feature common to all 
the schemes based on fc-anonymity: preserving the exact distribu- 
tion of sensitive values is important, and so no recoding is usually 
applied. Locatability is less universal, since it depends on the ex- 
act choice of clustering algorithm (used to form groups) and the 
recoding applied to the non-sensitive attributes. However, some 
schemes always satisfy locatability by virtue of their structure (e.g., 
schemes that recursively partition the data set along the lines of a 
hierarchy that is subsequently used for generalization |21 , 22]). For 
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Table 1: A simple example of a composition attack. Tables (a) and (b) are 4- 
anonymous (respectively, 6-anonymous) patient data from two hypothetical 
hospitals. If an Alice's employer knows that she is 28, lives in zip code 
13012 and visits both hospitals, he learns that she has AIDS. 



other schemes, locatability is not perfect but our experiments sug- 
gest that using simple heuristics one can locate a individual's group 
with high probability. 

Even with these properties, it is difficult to come up with a theo- 
retical model for intersection attacks because the partitioning tech- 
niques generally create dependencies that are hard to model an- 
alytically. However, if the sensitive values of the members of a 
group could be assumed to be statistically independent of their 
non-sensitive attribute values, then a simple birthday-paradox-style 
analysis would yield reasonable bounds. 

Experimental Results. Instead, we evaluated the success of in- 
tersection attacks empirically. We ran the intersection attack on 
two popular census databases anonymized using partition-based 
schemes. We evaluated the severity of such an attack by mea- 
suring the number of individuals who had their sensitive value re- 
vealed. Our experimental results confirm that partitioning-based 
anonymization schemes including fc-anonymity and its recent vari- 
ants, ^-diversity and t-closeness, are indeed vulnerable to intersec- 
tion attacks. Section[3]elaborates our methodology and results. 

Related Work on Modeling Background Knowledge. It is im- 
portant to point out that the partition-based schemes in the litera- 
ture were not designed to be used in contexts where independent 
releases are available. Thus, we do not view our results as pointing 
out a flaw in these schemes, but rather as directing the community's 
attention to an important direction for future work. 

It is equally important to highlight the progress that has already 
been made on modeling sophisticated background knowledge in 
partition-based schemes. One line has focused on taking into ac- 
count other, known releases, such as previous publications by the 
same organization ("sequential" releases, |33[[7][36)) and multiple 
views of the same data set (37). Another line has considered incor- 
porating knowledge of the clustering algorithm used to group indi- 
viduals 1 35 1. Most relevant to this paper are works that have sought 



to model unknown background knowledge. Martin et al. (27) and 
Chen et al. \ 9 1 provide complexity measures for an adversary's side 
information (roughly, they measure the size of the smallest formula 
within a CNF-like class that can encode the side information). Both 
works design schemes that provably resist attacks based on side in- 
formation whose complexity is below a given threshold. 

Independent releases (and hence composition attacks) fall out- 
side the models proposed by these works. The sequential release 
models do not fit because they deal assume the other releases are 
known to the anonymization algorithm. The complexity-based mea- 
sures do not fit because independent releases appear to have com- 
plexity that is linear in the size of the data set. 

1.1.2 Composing Randomization-based Schemes 

Composition attacks appear to be difficult to reason about, and it 
is not initially clear whether it is possible at all to design schemes 
that resist such attacks. Even defining composition properties pre- 
cisely is tricky in the presence of malicious behavior (for example, 
see (24| for a recent survey about composability of cryptographic 
protocols). Nevertheless, a significant family of anonymization 
definitions do provide guarantees against composition attacks, namely 
schemes that satisfy differential privacy |14| . Recent work has 
greatly expanded the applicability of differential privacy and its re- 
laxations, both in the theoretical (T5) [6] [14] [4] [28) and applied (T7] 
[3] |25| literature. However, certain recently developed techniques 
such as sampling |8|, instance-based noise addition [29| and data 
synthesis [ 25 1 appear to require relaxations of the definition. 

It is simple to prove that both the strict and relaxed variants of 
differential privacy compose well (see |13| |29| [28]). Less triv- 
ially, however, one can prove that strictly differentially-private al- 
gorithms also provide meaningful privacy in the presence of ar- 
bitrary side information (Dwork and McSherry, |12|). In partic- 
ular, these schemes compose well even with completely different 
anonymization schemes. 

It is natural to ask if there are weaker definitions which provide 
similar guarantees. Certainly not all of them do: one natural re- 
laxation of differential privacy, which replaces the multiplicative 
distance used in differential privacy with total variation distance, 
fails completely to protect privacy (see example 2 in |14|). 

In this paper, we prove that two important relaxations of differ- 
ential privacy do, indeed, resist arbitrary side information. First, 
we provide a Bayesian formulation of differential privacy which 
makes its resistance to arbitrary side information explicit. Second, 
we prove that the relaxed definitions of |13| [25 ] still imply the 
Bayesian formulation. The proof is non-trivial, and relies on the 
"continuity" of Bayes' rule with respect to certain distance mea- 
sures on probability distributions. Our result means that the re- 
cent techniques mentioned above [13 8 29 25] can be used mod- 
ularly with the same sort of assurances as in the case of strictly 
differentially-private algorithms. 

2. PARTITION-BASED SCHEMES 

Let D be a multiset of tuples where each tuple corresponds to 
an individual in the database. Let R be an anonymized version 
of D. From this point on, we use the terms tuple and individual 
interchangeably, unless the context leads to ambiguity. Let A — 
Ai, A2, ■ ■ ■ , A r be a collection of attributes and t be a tuple in R; 
we use the notation t[A] to denote (t[-Ai], . . . , t[A r ]) where each 
t[Ai] denotes the value of attribute Ai in table R for t. 

In partitioning-based anonymization approaches, there exists a 
division of data attributes into two classes, sensitive attributes and 
non-sensitive attributes. A sensitive attribute is one whose value 
and an individual's association with that value should not be dis- 



closed. All attributes other than the sensitive attributes are non- 
sensitive attributes. 

Definition 1 (Quasi-identifier). A set of non-sensitive 
attributes {Q\, . . . ,Q r } is called a quasi-identifier if there is at 
least one individual in the original sensitive database D who can 
be uniquely identified by linking these attributes with auxiliary data. 

Previous work in this line typically assumed that all the attributes 
in the database other than the sensitive attribute form the quasi- 
identifier. 

Definition 2 (Equivalence Class). An equivalence cla- 
ss for a table R with respect to attributes in A is the set of all tuples 
t\,ti, . . . ,ti £ Rfor which the projection of each tuple onto at- 
tributes in A is the same, i.e., ti[A] = t2[A] . . . = ii[j4]. 

Partition-based schemes cluster individuals into groups, and then 
recode (i.e., generalize or change) the non-sensitive values so that 
each group forms an equivalence class with respect to the quasi- 
identifiers. Sensitive values are not recoded. Different criteria are 
used to decide how, exactly, the groups should be structured. The 
most common rule is fc-anonymity, which requires that each equiv- 
alence class contain at least k individuals. 

DEFINITION 3 (fc- ANONYMITY). A release R is k-anonym- 

ous if for every tuple t £ R, there exist at least k — 1 other tuples 
ti,t 2 , . . . ,t k -i £ R such thatt[A] = ti[A] = . . . = t k ~ 1 [A]for 
every collection A of attributes in quasi-identifier. 

In our experiments we also consider two extensions to fc-anonymity. 

Definition 4 (Entropy ^-diversity). For an equivalence 
class E, let S denote the domain of the sensitive attributes, and 
p{E, s) is the fraction of records in E that have sensitive value s, 
then E is l-diverse if: 

-Y,P{E,s) log(p(E,s)) > log/. 

sGS 

A table is (-diverse if all its equivalence classes are l-diverse. 

DEFINITIONS (f-CLOSENESS). An equivalence class E is t- 
close if the distance between the distribution of a sensitive attribute 
in this class and distribution of the attribute in the whole table is 
no more than a threshold t. A table is t-close if all its equivalence 
classes are t-close. 

Locatability. As mentioned in the introduction, many anonymiza- 
tion algorithms satisfy locatability, that is, they output tables in 
which one can locate an individual's group based only on his or her 
non-sensitive values. 

Definition 6 (locatability). Let Q be the set of quasi- 
identifier values of an individual in the original database D. Given 
the k-anonymized release R of D, the locatability property allows 
an adversary to identify the set of tuples {ti, . . . , tn} in R (where 
K > k) that correspond to Q. 

Locatability does not necessarily hold for all partition-based sche- 
mes, since it depends on the exact choice of clustering algorithm 
(used to form groups) and the recoding applied to the non-sensitive 
attributes. However it is widespread. Some schemes always satisfy 
locatability by virtue of their structure (e.g., schemes that recur- 
sively partition the data set along the lines of a hierarchy always 
provide locatability if the attributes are then generalized using the 



same hierarchy, or if (min,max) summaries are used |21 , 22]). For 
other schemes, locatability is not perfect but our experiments sug- 
gest that using simple heuristics can locate a person's group with 
good probability. For example, microaggregation 1 1 1 ] clusters indi- 
viduals based on Euclidean distance. The vectors of non-sensitive 
values in each group are replaced by the centroid (i.e., average) 
of the vectors. The simplest heuristic for locating an individual's 
group is to choose the group with the closest centroid vector. In 
experiments on census data, this correctly located approximately 
70% of individuals. In our attacks, we always assume locatability. 
This assumption was also made in previous studies 1 30 . 27]. 

2.1 Intersection Attack 

Armed with these basic definitions, we now proceed to formalize 
the intersection attack (Algorithm 1). 



Algorithm 1 Intersection attack 

1 : Ri , . . . , R n <— n independent anonymized releases 

2: P <— set of overlapping population 

3: for each individual i in P do 

4: for j — 1 to n do 

5: Sij <— Get_equivalence_class i) 

6: Sij <— Sensitive_value_set(eij) 

7: end for 

8: Si <— sn n n . . . n s in 

9: end forRETURN S S ]P] 



Let R 1 , . . . , R n be n independent anonymized releases with min- 
imum partition-sizes of ki, . . . ,k n , respectively. Let P be the 
overlapping population occurring in all the releases. The function 
Get_equivalence_class returns the equivalence class into which an 
individual falls in a given anonymized release. The function Sen- 
sitive_value_set returns the set of (distinct) sensitive values for the 
members in a given equivalence class. 

Definition 7 (Anonymity). For each individual i in P, 
the anonymity factor promised by each release Rj is equal to the 
corresponding minimum partition-size kj. 

However, as pointed out in [26], the actual anonymity offered is 
less than this ideal value and is equal to number of distinct values 
in each equivalence class. We call this as the effective anonymity 

Definitions (Effective Anonymity). For an individual 
i in P, the effective anonymity offered by a release Rj is equal to 
the number of distinct sensitive values of the partition into which 
the individual falls into. Let eij be the equivalence class or parti- 
tion into which i falls into with respect to the release Rj, and let 
denote the sensitive value set for eij. The effective anonymity 
fori with respect to the release Rj is: EAij = \sij \ . 

For each target individual i, EAij is the effective prior anonymity 
with respect to Rj (anonymity before the intersection attack). In the 
intersection attack, the list of possible sensitive values associated 
to the target is equal to intersection of all sensitive value sets Sij, 
j = 1, , n. So the effective posterior anonymity (EAi) for i is: 

EAi = \{n Sij }\,j =1 n. 

The difference between the effective prior anonymity and effec- 
tive posterior anonymity quantifies the drop in effective anonymity. 

Anon -Drop i = min {EAij} — EAi ■ 

j — l,...,n 



The vulnerable population ( VP) is the number of individuals 
(among the overlapping population) for whom the intersection at- 
tack leads to a positive drop in the effective anonymity. 

VP — {i G P : Anon.Drop 1 > 0} . 

After performing the sensitive value set intersection, the adver- 
sary knows only a possible set of values that each individual's sen- 
sitive attribute can take. So, the adversary deduces that with equal 
probability (under the assumption that the adversary does not have 
any further auxiliary information) the individual's actual sensitive 
value is one of the values in the set {CiSij} ,j = 1, . . . , n. So, the 
adversaries confidence level for an individual i can be defined as: 

Definition 9 (Confidence level d). For each individ- 
ual i, the confidence level Ci of the adversary in identifying the 
individual's true sensitive value through the intersection attack is 
defined as d = =i- . 

Now, given some confidence level C, we denote by VPc and 
PVPc the set and the percentage of overlapping individuals for 
whom the adversary can deduce the sensitive attribute value with a 
confidence level of at least C. 

VP c = {i€P : Ci>C}, 
PVPc = I vPc p l-ioo _ 

3. EXPERIMENTAL RESULTS 

In this section we describe our experimental stud)Q The pri- 
mary goal is to quantify the severity of such an attack on exist- 
ing schemes. Although the earlier works address problems with fc- 
anonymization and adversarial background knowledge, to the best 
of our knowledge, none of these studies deal with attacks result- 
ing from auxiliary independent releases. Furthermore, none of the 
studies so far have quantified the severity of such an attack. 

3.1 Setup 

We use three different partitioning-based anonymization tech- 
niques to demonstrate the intersection attack: fc-anonymity, ^-diver- 
sity, and f-closeness. For fc-anonymity, we use the Mondrian mul- 
tidimensional approach proposed in |21| and the microaggregation 
technique proposed in [11]. For ^-diversity and t-closeness, we 
use the definitions of entropy ^-diversity and f-closeness proposed 
in [26] and [23|, respectively. 

We use two census-based databases from the UCI Machine Learn- 
ing repository |31| . The first one is the Adult database that has been 
used extensively in the fc-anonymity based studies. The database 
was prepared in a similar manner to previous studies |21 , 26 1 (also 
explained in Table |2j. The resulting database contained individual 
records corresponding to 30162 people. The second database is 
the IPUMS database that contains individual information from the 
1997 census studies. We only use a subset of the attributes that are 
similar to the attributes present in the Adult database to maintain 
uniformity and to maintain quasi-identifiers. The IPUMS database 
contains individual records corresponding to a total of 70187 peo- 
ple. This data set was prepared as explained in Table[3] 

From both Adult and IPUMS databases, we generate two over- 
lapping subsets (Subset 1 and Subset 2) by randomly sampling in- 
dividuals without replacement from the total population. We fixed 
the overlap size to P = 5000. For each of the databases, the two 
subsets are anonymized independently and the intersection attack 

'The code, parameter settings, and complete results are made avail- 
able at: Ihttp : / /www . cse . psu . edu/~ran jit /kdd08| 
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Table 2: Description of the Adult census database. 
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Table 3: Description of the IPUMS census database. 



is run on the anonymization results. All the experiments were run 
on a Pentium 4 system running Windows XP with 1GB RAM. 

3.2 Severity of the Attack 

Our first goal is to quantify the extent of damage possible through 
the intersection attack. For this, we consider two possible situa- 
tions: (i) Perfect breach and (ii) Partial breach. 

3.2.1 Perfect Breach 

A perfect breach occurs when the adversary can deduce the exact 
sensitive value of an individual. In other words, a perfect breach 
is when the adversary has a confidence level of 100% about the 
individual's sensitive data. To estimate the probability of a perfect 
breach, we compute the percentage of overlapping population for 
whom the intersection attack leads to a final sensitive value set of 
size 1. Figure[T]plots this result. 

We consider three scenarios for anonymizing the two overlap- 
ping subsets: (i) Mondrian on both the data subsets, (ii) Microag- 
gregation on both the data subsets, and (iii) Mondrian on the first 
subset and microaggregation on the second subset, (ki, kz) repre- 
sents the pair of k values used to anonymize the first and the second 
subset, respectively. In the experiments, we use the same k values 
for both the subsets (fci = fo). Note that for simplicity, from now 
on we will be defining confidence level in terms of percentages. 

In the case of Adult database we found that around 12% of the 
population is vulnerable to a perfect breach for fci = k-z = 5. For 
the IPUMS database, this value is much more severe around 60%. 
As the degree of anonymization increases or in other words, as the 
value of k increases, the percentage of vulnerable population goes 
down. The reason for that is that as the value of k increases, the 
partition sizes in each subset increases. This leads to a larger inter- 
section set and thus lesser probability of obtaining an intersection 
set of size 1. 

3.2.2 Partial Breach 

Our next experiment aims to compute a more practical quan- 
tification of the severity of the intersection attack. In most cases, 
to inflict a privacy breach, all that the adversary needs to do is to 
boil down the possible sensitive values to a few values which it- 
self could reveal a lot of information. For example, for a hospi- 
tal discharge database, by boiling down the sensitive values of the 



disease/diagnosis to a few values, say, "Flu", "Fever", or "Cold", 
it could be concluded that the individual is suffering from a viral 
infection. In this case, the adversary's confidence level is 1/3 = 
33%. Figure [2] plots the percentage of vulnerable population for 
whom the intersection attack leads to a partial breach for the Adult 
and IPUMS databases. 

Here, we only use the first anonymization scenario described 
earlier in which both the overlapping subsets of the database are 
anonymized using Mondrian multidimensional technique. Observe 
that the severity of the attack increases alarmingly for slight relax- 
ation on the required confidence level. For example, in the case 
of IPUMS database, around 95% of the population was vulnerable 
for a confidence level of 25% for fci = fc 2 = 5. For the Adult 
database, although this value is not as alarming, more than 60% of 
the population was affected. 

3.3 Drop in Anonymity 

Our next goal is to measure the drop in anonymity occurring 
due to the intersection attack.To achieve this, we first take a closer 
look at the way these schemes work. As described in the earlier 
sections, the basic paradigm in partitioning-based anonymization 
schemes is to partition the data such that each partition size is at 
least k. The methodology behind partitioning and then summariz- 
ing varies from scheme to scheme. The minimum partition-size 
(k) is thus used as a measure of the anonymity offered by these 
solutions. However, the effective (or true) anonymity supported by 
these solutions is far less than the presumed anonymity k (refer to 
the discussion in Section |2~T) . 

Figure [3] plots the average partition sizes and the average effec- 
tive anonymities for the overlapping population. Here again, we 
only consider the scenario where both the overlapping subsets are 
anonymized using Mondrian multidimensional technique. Observe 
that the effective anonymity is much less than the partition size for 
both the data subsets. Also, note that these techniques result in 
partition sizes that are much larger than the minimum required of 
k. For example, the average partition size observed in the IPUMS 
database for k = 5 is close to 40. To satisfy the fc-anonymity def- 
inition, there is no need for any partition to be larger than 2k + 1. 
The reasoning for this is straightforward as splitting the partition 
of size greater than 2k + 1 into two we get partitions of size at 
least k. Additionally, splitting any partition of size 2k + 1 or more 
only results in preserving more information. The culprit behind 
the larger average partition sizes is generalization based on user- 
defined hierarchies. Since generalization-based partitioning cannot 
be controlled at finer levels, the resulting partition sizes tend to be 
much larger than the minimum required value. 

For each individual in the overlapping population, the effective 
prior anonymity is equal to the effective anonymity. We define 
the average effective prior anonymity with respect to a release as 
effective prior anonymities averaged over the individuals in the 
overlapping population. Similarly, the average effective posterior 
anonymity is the effective posterior anonymities averaged over the 
individuals in the overlapping population. The difference between 
the average effective prior anonymity and average effective poste- 
rior anonymity gives the average drop in effective anonymity occur- 
ring due to the intersection attack. Figure|4]plots the average effec- 
tive prior anonymities and the average effective posterior anonymi- 
ties for the overlapping population. Observe that the average ef- 
fective posterior anonymity is much less than the average effective 
prior anonymity for both subsets. Also note that we measure drop 
in anonymities by using effective anonymities instead of presumed 
anonymities. The situation only gets worse (drops get larger) when 
presumed anonymities are used. 
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Figure 1: Severity of the intersection attack - perfect breach (a) Adult database (b) IPUMS database. 
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Figure 2: Severity of the intersection attack - partial breach (a) Adult database (b) IPUMS database. 
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Figure 3: Comparison of presumed anonymity, actual partition sizes, and effective anonymity (a) Adult database (b) IPUMS database. 



3.4 ^-diversity and t-closeness 

We now consider the ^-diversity and t-closeness extensions to 
the original fc-anonymity definition. The goal again is to quantify 
the severity of the intersection attack by measuring the extent to 
which a partial breach occurs with varying levels of adversary con- 
fidence levels. Figure [5] plots the percentage of vulnerable popula- 
tion for whom the intersection attack leads to a partial breach for 



the Adult and IPUMS databases. Here, we anonymize both the sub- 
sets of the database with the same definition of privacy. We use the 
mondrian multidimensional fc-anonymity with the additional con- 
straints as defined by ^-diversity and f-closeness. Figure[5ja) plots 
the result for the ^-diversity using the same £ value for both the 
subsets (£i = I2) and with fc = 10. Figure[5jb) plots the same for 
t-closeness. Even though these extended definitions seem to per- 




Figure 4: Average drop in effective anonymity due to the intersection attack (a) Adult database (b) IPUMS database. 



form better than the original fc-anonymity definition, they still lead 
to considerable breach in case of an intersection attack. This result 
is fairly intuitive in the case of ^-diversity. Consider the definition 
of ^-diversity: the sensitive value set corresponding to each parti- 
tion should be "well" {£) diverse. However, there is no guarantee 
that the intersection of two well diverse sets leads to a well diverse 
set. t-closeness fares similarly. Also, both these definitions tend to 
force larger partition sizes, thus resulting in heavy information loss. 
Figure [6] plots the average partition sizes of the individuals corre- 
sponding to the overlapping population. It compares the partition 
sizes observed for fc-anonymity, i'-diversity, and f-closeness. For 
the IPUMS database, with a value of k = 10, fc-anonymity pro- 
duces partitions with an average partition size of 45. While, for the 
same value of k = 10, with a value of I — 5, the average partition 
size obtained was close to 450. The partition sizes for t-closeness 
get even worse, where a combination of k = 10 and t = 0.4 yield 
partitions of average size close to 1300. We can observe similar 
results for the Adult database. 

3.5 Role of Sensitive Attribute Domain 

In all of the above experiments we use the "Occupation" (oc- 
cupation code of the individual) as the sensitive attribute for both 
Adult and IPUMS databases as shown in Tables [2] and [3] The do- 
main size of the Occupation attribute in the Adult database was 14 
whereas, the domain size in the IPUMS database was 247. One 
of the plausible reasons for the attack to be more severe in case 
of the IPUMS database was the size of the sensitive attribute do- 
main. This is because most of partition sizes are way larger than 
the minimum value required i.e. k, in case of the Adult database, it 
is possible that the sensitive value set corresponding to every par- 
tition contains all the possible values in the domain. This implies 
that an intersection of two sensitive value sets results in a set of size 
close to the size of the domain. Thus, it is possible that intersection 
attack will be less effective in cases where the sensitive attribute 
domain size is less than the average partition size. Intuitively, it 
seems like that in cases where the sensitive attribute domain size 
is large (of the order of several hundreds) the intersection attack 
would be more severe. Also, most real-life databases have sensi- 
tive attributes with large domain sizes. For example, if we consider 
a typical hospital discharge database, an ICD9 code is used to de- 
scribe the diagnosis given to the patient. The possible values for 
this code is a number from 1 to 999 1 19 1 indicating the code for 
the specific patient diagnosis. In other cases, the sensitive attribute 
domain sizes tend be larger than this. The conjecture is that as the 



Sensitive Attribute 


Domain Size 


Diversity 


Occupation 


247 


4.30 


Industry 


145 


4.35 


Income 


471 


5.56 



Table 4: IPUMS database versions (Non-Sensitive attributes remain same 
as the original) 



number of possible sensitive values increases, the intersection of 
two different sets results in a less diverse set. 

In order to confirm this, we constructed two new versions of the 
IPUMS database by replacing the sensitive attribute "Occupation" 
of each individual with "Industry" corresponding to the individual's 
work and "Income" corresponding to the total income of the indi- 
vidual. The domain sizes corresponding to these attributes is sum- 
marized in Table|4] The domain size for "Industry" attribute is 145, 
for the original "Occupation" attribute si 247 and that of "Income" 
is 471. Table[4]summarizes this. We ran the intersection attack on 
these new versions of the IPUMS database and compared it with the 
original. Figure|7]plots the average drop in effective anonymity for 
the overlapping population. Based on our conjecture, the drop in ef- 
fective anonymity should increase with the increase in the sensitive 
attribute domain size. Surprisingly we did not observe the trend we 
were expecting. The drop in effective anonymity in case of "Occu- 
pation" was less than when compared with "Industry". It turns out 
that the reason for this is that the actual number of possible values 
for each sensitive attribute does not necessarily be the same as the 
domain size, or in other words the total number of possible values. 
So, a large sensitive attribute domain size does not guarantee that 
the number of possible values actually occuring is large. Instead, 
a simple entropy measure such as the shannon's entropy could be 
used to measure the actual number of possible values. The entropy 
value for each of these attributes is listed in Table [4] Although 
the actual domain size for 'Occupation" attribute is larger, its en- 
tropy is less than that of than that of the "Industry" attribute. Now, 
the conjecture is that as the entropy (or information content) of the 
sensitive attribute increases, the severity of intersection attack in- 
creases. Our result in Figure [7] confirms this. The average drop in 
effective anonymity increases with the entropy of the correspond- 
ing sensitive attribute domain since the non-sensitive attributes are 
kept the same for all the datasets. 

3.6 Number of Databases 
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Figure 5: Severity of the intersection attack - 1-diversity and t-closeness (a)(b) Adult Database (c)(d) IPUMS Database 
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Figure 6: Average partition sizes for ^-diversity and t-closeness (a) Adult Database (b) IPUMS Database 



In the above experiments we have considered the scenario in 
which two anonymized releases contain information about over- 
lapping population. As data publishing becomes more prevelant 
among organizations that would like to share data for research and 
collaborative purposes, it is possible that the number of anonymized 
releases available containing information about the same subset of 
people is more than just two. The adversary could use as many 
anonymized releases as possible to gather information about a tar- 
get population and use the intersection attack to deduce the sen- 
sitive attribute values. In such a scenario, it is interesting to see 



how the intersection attack performs in the presence multiple (more 
than 2) overlapping anonymized releases. We first consider the per- 
centage of vulnerable population with a confidence level of 100% 
(PVPioo%). Figure[8|a) plots this for varying number (n = 2, 3, 4) 
of anonymized releases available to adversary. Here again, we build 
n overlapping subsets of the IPUMS database by fixing the over- 
lapping population at 5000. It can be observed that the severity 
of the intersection attack increases with the increase in the num- 
ber of anonymized releases available to the adversary. There is a 
significant increase in the percentage of vulnerable population with 
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Figure 7: Effect of sensitive attribute domain size - IPUMS database. 



the increase in n, for small values of k. However, there seem to 
be no such significant increase for larger values of k. The reason 
for this is that the partition sizes for larger values of k tend to be 
large enough such that the presence of additional anonymized re- 
leases does not help the intersection attack anymore. Alternative 
to the severity of the attack, we can study the effect of the number 
of anonymized releases on the drop in effective anonymity. Fig- 
ure [8jb) plots the average drop in effective anonymity for varying 
number (n = 2, 3, 4) of anonymized releases. Here again we can 
observe that drop in effective anonymity increases with the increase 
in the number of anonymized releases. These results indicate that 
if the anonymized releases correspond to fairly larger values of k, 
there is only limited information gained by the adversary by col- 
lecting additional releases. 

4. DIFFERENTIAL PRIVACY 

In this section we give a precise formulation of "resistance to ar- 
bitrary side information" and show that several relaxations of dif- 
ferential privacy imply it. The formulation follows the ideas origi- 
nally due to Dwork and McSherry, stated implicitly in 1 1 2| . This is, 
to our knowledge, the first place such a formulation appears explic- 
itly. The proof that relaxed definitions (and hence the schemes of 
|13|[29 25 1) satisfy the Bayesian formulation is new. These results 
are explained in a greater detail in a separate technical report |20| . 
In this paper we just reproduce the relevant parts from (20| . 

We represent databases as vectors in T> n for some domain T> 
(for example, in the case of the relational databases above, T> is the 
product of the attribute domains). There is no distinction between 
"sensitive" and "insensitive" information. Given a randomized al- 
gorithm A, we let A(D) be the random variable (or, probability 
distribution on outputs) corresponding to input D. 

Definition 10 (Differential Privacy). A randomized al- 
gorithm A is e-differentially private if for all databases Di, D 2 
G T) n that differ in one individual, and for all subsets S of outputs, 
Pr[.A(£>i) eS]< e e ¥r[A{D 2 ) £ S]. 

This definition states that changing a single individual's data in 
the database leads to a small change in the distribution on outputs. 
Unlike more standard measures of distance such as total variation 
(also called statistical difference) or Kullback-Leibler divergence, 
the metric here is multiplicative and so even very unlikely events 
must have approximately the same probability under the distribu- 
tions A(Di) and A(D2). This condition was relaxed somewhat in 
other papers [To] [B] [3] [13] [8] [29] [25). The schemes in all those 
papers, however, satisfy the following relaxation [13]: 



DEFINITION 11. A randomized algorithm A is (e, (^-differ- 
entially private if for all databases Di , D 2 G T> n that differ in 
one individual, and for all subsets S of outputs, Pr[.4(Di) £ S] < 
e< Pr[A(D 2 ) e S] + S . 

The relaxations used in [15 6, 25 1 were in fact stronger (i.e., less 
relaxed) than Definition [To] One consequence of the results below 
is that all the definitions are equivalent up to polynomial changes 
in the parameters, and sogiven the space constraints we work only 
with the simplest notionn 

4.1 Semantics of Differential Privacy 

There is a crisp, semantically-flavored interpretation of differen- 
tial privacy, due to Dwork and McSherry, and explained in 1 1 2| : 
Regardless of external knowledge, an adversary with access to the 
sanitized database draws the same conclusions whether or not my 
data is included in the original data, (the use of the term "seman- 
tic" for such definitions dates back to semantic security of encryp- 
tion (18)). 

We require a mathematical formulation of "arbitrary external 
knowledge", and of "drawing conclusions". The first is captured 
via a prior probability distribution 6 on D n (b is a mnemonic for 
"beliefs"). Conclusions are modeled by the corresponding poste- 
rior distribution: given a transcript t, the adversary updates his be- 
lief about the database D using Bayes' rule to obtain a posterior b: 

Pr[A(D) = t]b[D] 



b[D\t] 



£ , Pr[A(D>) = t]b[D>] 



(1) 



In an interactive scheme, the definition of A depends on the ad- 
versary's choices; for simplicity we omit the dependence on the 
adversary in the notation. Also, for simplicity, we discuss only dis- 
crete probability distributions. Our results extend directly to the 
interactive, continuous case. 

For a database D, define D_i to be the vector obtained by replac- 
ing position i by some default value in D (any value in D will do). 
This corresponds to "removing" person i's data. We consider n + 1 
related scenarios ("games", in the language of cryptography), num- 
bered through n. In Game 0, the adversary interacts with A(D). 
This is the interaction that takes place in the real world. In Game 
i (for 1 < i < n), the adversary interacts with A(D-i). Game 
i describes the hypothetical scenario where person i's data is not 
included. 

For a particular belief distribution b and transcript t, we consider 
the n + 1 corresponding posterior distributions bo, ■ ■ ■ , b n . The 
posterior 6q is the same as b (defined in Eq. (|T)). For larger i, the 
i-th posterior distribution 6, represents the conclusions drawn in 
Game i, that is 



bi[D\t] 



Pr[A(D-i) = t]b[D] 
£ D , Pr[^(^'_ i ) = t]b[D>] 



Given a particular transcript t, privacy has been breached if there 
exists an index i such that the adversary would draw different con- 
clusions depending on whether or not i's data was used. It turns out 
that the exact measure of "different" here does not matter much. We 
chose the weakest notion that applies, namely statistical difference. 
If P and Q are probability measures on the set X, the statistical 
difference between P and O is defined as: 



SD( 



max I P [SI 

sex ' 1 J 



That said, some of the other relaxations, such as probabilistic dif- 
ferential privacy of (25), could lead to better parameters in Theo- 
rem|15| 




Figure 8: Effect of Number of Anonymized Releases - IPUMS Database (a) Percentage of Vulnerable Poplation (b) Drop in Effective Anonymity 



Definition 12. An algorithm A is e-semantically private if 

for all prior distributions b on T> n , for all databases D G T> n , for 
all possible transcripts t, and for all i = 1 , , n, 

ST>(bo[D\i\,bi[D\t]^ <e. 
This can be relaxed to allow a probability 8 of failure. 

Definition 13. An algorithm is (e, <5)-semantically private 

if, for all prior distributions b, with probability at least 1 — 8 over 
pairs (D, t), where the database D <— b (D is drawn according to 
b) and the transcript t <— A(D) (t is drawn according to A(D)), 

foralli = l,...,n: SD (b [D\i\ , bi[D\t]) < e. 

Dwork and McSherry proposed the notion of semantic privacy, 
informally, and observed that it is equivalent to differential privacy. 

PROPOSITION 14 (DWORK-MCSHERRY). ^-differential pri- 
vacy implies Z-semantic privacy, where e" = e s — 1. 

We show that this implication holds much more generally: 

Theorem 15 (Main Result). (e,8)-differential privacy im- 



plies (e , 8') -semantic privacy where e 
8' = 0{nV8). 



2^/8 and 



Theorem[T5]states that the relaxations notions of differential pri- 
vacy used in some previous work still imply privacy in the face of 
arbitrary side information. This is not the case for all possible re- 
laxations, even very natural ones. For example, if one replaced the 
multiplicative notion of distance used in differential privacy with 
total variation distance, then the following "sanitizer" would be 
deemed private: choose an index i g {!,... ,n] uniformly at ran- 
dom and publish the entire record of individual i together with his 
or her identity (example 2 in 1 14]). Such a "sanitizer" would not be 
meaningful at all, regardless of side information. 

Finally, the techniques used to prove Theorem [T5] can also be 
used to analyze schemes which do not provide privacy for all pairs 
of neighboring databases Di and D 2 , but rather only for most such 
pairs (neighboring databases are the ones that differ in one indi- 
vidual). Specifically, it is sufficient that those databases where the 
"indistinguishability" condition fails occur with small probability. 

Definition 16 ((e, 5)-indistinguishability). Two rand- 
om variables X, Y taking values in a set X are (e, S)-indistingui- 
shable if for all sets S C X, Vx[X G S] < e € Pr[F 6 S] + 5 and 
Pr[Y G S] < e e Pr[X 6 S] + 5 . 



THEOREM 17. Let A be a randomized algorithm. Let £ = 
{£>! G V n : V neighbors D 2 of Di, A(Di) and A(D 2 ) are 
(e,8)-indistinguishable} . Then A satisfies (e' , S')-semantic pri- 
vacy for any prior distribution b such that b[£] = Pr D 3 *-b[D-$ G 
£] > 1 - 5 with e' = e 3e - 1 + 2^/8 and 8' = 0{nVS). 

4.2 Proof Sketch for Main Results 

The complete proofs are described in |20|. Here we sketch the 
main ideas behind both the proofs. Let y|x= a denote the condi- 
tional distribution of Y given that X = a for jointly distributed 
random variables X and Y. The following lemma (proof omitted) 
plays an important role in our proofs. 

LEMMA 18 (MAIN LEMMA). Suppose two pairs of random 
variables (X, A(X)) and (Y, A' (Y)) are (e, 8)-indistinguishable 
(for some randomized algorithms A and A' ). Then with probability 
at least 1 — 8" overt «— A(X) (equivalentlyt <— A'(Y)), the ran- 



dom variables X\ 



A(X)=t 



and Y\a> 



are (e,8)-indistinguish- 



able with e — 3e, 8 — 2\/~8, and 8" = \/8 + 



o(Vs). 



Let A be a randomized algorithm (in the setting of Theorem|15| 
A is a (e, 8) -differentially private algorithm). Let b be a belief dis- 
tribution (in the setting of Proposition[T7] 6 is a belief with b(£) > 
1 — 8). The main idea behind both the proofs is to use Lemma [78] 
to show that with probability at least 1 — 0(s/8) over pairs (D, t) 
where D <— b and t <— A(D), SD (b\ A(D)=t , b\ A(D _ i)=t ) < e'. 
Taking a union bound over all coordinates i, implies that with prob- 
ability at least 1 — 0(n\/~8) over pairs (D, t) where D fe andt <— 
A(D), for alH = 1, . . . , n, we have SD (b\ A(D)=t , b\ A(D _ z)=t ) 
< e . For Proposition| 17| 11 shows that A satisfies (e , <5')-semantic 
privacy for b. In the Theorem 15 setting where A is (e, ^-different- 



ially private and b is arbitrary, it shows that (e, 8) -differential pri- 
vacy implies (e', <5')-semantic privacy. 

5. CONCLUDING REMARKS 

In this paper we explored how one can reason about privacy 
in the presence of independent anonymized releases of overlap- 
ping population. Our experimental study indicates that several cur- 
rently proposed partition-based anonymization schemes, including 
fc-anonymity and its variants, are vulnerable to composition attacks. 
On the positive side, we gave a precise formulation of the property 
"resistance to arbitrary side information" and show that several re- 
laxations of differential privacy satisfy it. 



The most striking question that arises from this work is whether 
randomness in the anonymization algorithm is necessary to resist 
complex side information such as independent releases. Another 
interesting direction would be to study other settings where com- 
position attacks are realistic and effective? A natural candidate for 
future investigation are the releases of overlapping contingency ta- 
bles that are often considered in the statistical literature. 
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